Linked Data Finland: A 7-star Model and Platform for Publishing and Re-using Linked Datasets
نویسندگان
چکیده
The idea of Linked Data is to aggregate, harmonize, integrate, enrich, and publish data for re-use on the Web in a cost-efficient way using Semantic Web technologies. We concern two major hindrances for re-using Linked Data: It is often difficult for a re-user to 1) understand the characteristics of the dataset and 2) evaluate the quality the data for the intended purpose. This paper introduces the “Linked Data Finland” platform LDF.fi addressing these issues. We extend the famous 5-star model of Tim Berners-Lee, with the sixth star for providing the dataset with a schema that explains the dataset, and the seventh star for validating the data against the schema. LDF.fi also automates data publishing and provides data curation tools. The first prototype of the platform is available on the web as a service, hosting tens of datasets and supporting several applications. 1 Publishing Linked Data Lots of Linked Data (LD) platforms have emerged on the Web since the publication of the four Linked Data publication principles and the 5-star model1. For example, in Life Sciences alone there are LinkedLifeData2, NeuroCommons3, Chem2Bio2RDF4, HCLSIG/LODD5, BioLOD6, and Bio2RDF7. LDF.fi8 contributes to the current state-of-the-art of Linked Data publishing [2] as follows: 1) We propose extending the 5-star model9 into a 7-star model, with the goal of encouraging data publishers to provide their data with explicit metadata schemas and to validate their data for better quality. 2) LDF.fi automates the data publishing process so that not only a SPARQL endpoint but also a rich set of additional data services are generated automatically based on the metadata about the dataset and its graphs. 3) LDF.fi 1 http://www.w3.org/DesignIssues/LinkedData.html 2 http://linkedlifedata.com/ 3 http://neurocommons.org/ 4 http://chem2bio2rdf.wikispaces.com/ 5 http://www.w3.org/wiki/HCLSIG/LODD 6 http://biolod.org/ 7 http://bio2rdf.org/ 8 Our work is funded by Tekes and a consortium of 20 public organizations and companies. 9 http://5stardata.info/ provides end users with additional tools and documentation for publishing, curating, and re-using the datasets. This paper first explains these ideas, and then presents the actual service available online10. 2 7-star Linked Data A major hindrance of re-using a dataset is the difficulty to evaluate how suitable the data is for the application purpose at hand. Datasets often use schemas (vocabularies) for which definitions or descriptions are not available, but are embedded in the data itself. This makes it difficult to figure out the characteristics of the data. Furthermore, given the data and its schema it may be difficult to say how well the data actually matches the schema; there are lots of data quality problems on the Semantic Web11. To address these issues, we encourage data publishers by two extra stars: – The 6th star is given if the schemas (vocabularies) used in the dataset are explicitly described and published alongside the dataset, unless the schemas are already available somewhere on the Web. – For the 7th star, the quality of the dataset against the schemas used in it must be explicated, so that the user can evaluate whether the data quality matches her needs. LDF.fi provides supporting tools related to these issues: First, schemas are documented automatically for the human reader by using a schema documentation generator. In our case, the LODE12 online service is employed. (Other possible tools for schema documentation include SpecGen, Neologism13, dowl14, Parrot15, OWLDoc16, and OntologyBrowser17.) Second, in order to find out how schemas are actually used in a dataset, we created a new service http://vocab.at [1]. It analyses a dataset, creates an HTML report that explains vocabulary usage in the data, and reports issues of undefined properties or unresolvable namespaces. The input for vocab.at is either an RDF file, a SPARQL endpoint, or an HTML page with embedded RDFa markup. 3 Automatic Service Generation LDF.fi tries to automate the process of publishing datasets as far as possible in the following way: The publisher is expected to create an RDF dataset with minimal metadata about it and its schemas. Here an extended version of the new W3C Service Description recommendation18 and the VoID vocabulary19 can be used, and the data is stored 10 http://www.ldf.fi/ 11 http://pedantic-web.org/ 12 http://www.essepuntato.it/lode 13 http://neologism.deri.ie/ 14 https://github.com/ldodds/dowl 15 http://ontorule-project.eu/parrot/parrot 16 http://code.google.com/p/co-ode-owl-plugins/wiki/OWLDoc 17 http://code.google.com/p/ontology-browser/ 18 http://www.w3.org/TR/sparql11-service-description/ 19 http://rdfs.org/ns/void into the SPARQL endpoint. Alternatively, a simple JSON object listing the dataset and graph names, human readable labels, and a description of the data can be provided. In the metadata, it is also possible to give an example URI pointing into the dataset, a SPARQL query example for querying the data, and optionally a link to possible visualizations of the dataset. Based on such metadata, LDF.fi generates for each dataset a home page on which the following functionalities are available for re-users: 1. Links for downloading datasets and graphs are provided (if licensing permits it). 2. Schemas can be downloaded if provided with the data, and links to their documentation are provided (when available). 3. Following forms are created for inspecting the dataset in more detail: 1) Given a URI the corresponding RDF description can be read in various formats (Turtle, RDF/XML, RDF/JSON, N3, N-triples) for human consumption in a browser. The example URI is used as a first choice to try out. 2) Given a URI, Linked Data browsing can be started from it, with the example URI as a starting point. 4. There is a SPARQL query form for querying the service with the given query used as a first example. 5. Links providing Vocab.at analysis reports of the graphs in the dataset are provided. They tell the end-user what schemas (vocabularies) are used in the data, and how they have been used. Issues on data quality are pointed out. 6. SPARQL Service Descriptions of the datasets are provided, if available. LDF uses W3C SPARQL Service Description recommendation for this. 7. Links to visualizations of the data that may give the re-user more insight on how the dataset can be used in applications. 8. Licensing conditions of the dataset are provided as well as a label of 1–7 stars. 4 Data Curation Tools Data curation refers to activities and processes done to create, manage, maintain, and validate data. In LDF.fi several data curation services are available for analyzing textual data and for creating semantic annotations (semi-)automatically from them: 1. SeCo Lexical Analysis Services20 can be used for language recognition, lemmatization, morphological analysis, inflected form generation, and hyphenation. 2. ARPA Automatic Text Annotation System21 can be used for extracting Linked Data from unstructured texts. 3. SAHA22 tool can be used for investigating and editing LDF.fi datasets interactively in real time. In LDF.fi we modified and extended SAHA to work on top of any standard SPARQL endpoint. SAHA is now used as a Linked Data Browser in LDF.fi in the same vein as, e.g., URIBurner23. Using SAHA as an editor service for a dataset requires permission from the LDF.fi team. 20 http://demo.seco.tkk.fi/las/ 21 http://www.seco.tkk.fi/services/arpa/ 22 http://www.seco.tkk.fi/tools/saha 23 http://linkeddata.uriburner.com/ In our work, we are also using some external tools, such as the SILK Framework24 for linking data.
منابع مشابه
Second World War on the Semantic Web: The WarSampo Project and Semantic Portal
This paper initiates and fosters work on publishing Linked Open Data about the Second World War. It is argued that the heterogeneous, distributed data about the international world war history makes a promising use case for semantic technologies. We hope that by making war data openly available we can learn from the past and promote peace. 1 Publishing Linked Open Data about War History Accordi...
متن کاملBenefits of Publishing the Norwegian Petroleum Directorate’s FactPages as Linked Open Data∗
This paper presents the benefits from publishing the Norwegian Petroleum Directorate’s FactPages, a public and popular, freely available dataset, into semantically annotated and query enabled five star linked open data. We discuss and illustrate the added value of publishing open datasets on the Web using web standards, semantic web technologies and best practices— contrasted to the lesser suit...
متن کاملAdaptive Semantic Publishing
The paper describes the approach, methodology and main software components of an Adaptive Semantic Publishing Platform for digital medias; applied previously to numerous use cases and publishers like the BBC, EuroMoney and Press Association. The semantic publishing relies on the interaction among the common sense model in ontologies, the world knowledge in Linked Open Data (LOD), the named enti...
متن کاملSetting up a Global Linked Data Catalog of Datasets for Agriculture
The movement to share data has been on the rise in the last decade and lately in the agricultural domain. Similarly platforms for publishing scientific and statistical datasets have sprouted and have improved visibility and availability of datasets. Yet there are still constraints in making datasets discoverable and reusable. Commonly agreed semantics, authority lists to index datasets and stan...
متن کاملThe Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014